A Framework for Generating Distributed-Memory Parallel Programs for Block Recursive Algorithms

نویسندگان

  • Sandeep K. S. Gupta
  • Chua-Huang Huang
  • P. Sadayappan
  • Rodney W. Johnson
چکیده

and implement high-performance algorithms to compute the discrete Fourier transform (DFT) [17, 24] and matrix multiplication [14, 15] on shared-memory vector multiprocessors. The significance of the tensor product lies in its ability to model both the computational structures occurring in block-recursive algorithms and the underlying hardware structures, such as the interconnection networks [19, 20]. The goal of this paper is to develop a framework for synthesizing efficient distributed-memory programs for block recursive algorithms. In this framework we start with a mathematical specification of the computation using tensor products. We present techniques for developing efficient message passing codes by analyzing the structure of the input mathematical formulas. Designing and implementing an algorithm using the tensor product involves viewing the computation as a linear transformation, rewriting the linear transformation as a matrix product of tensor products of smaller computation matrices, recursively applying the rewriting rule to smaller computation matrices, and translating the components of the resulting tensor product formula into vector/parallel operations, iterative loops, and data movement operations. Once expressed using the tensor product notation, several forms of the algorithm with different performance characteristics can be derived by exploiting the algebraic properties of its matrix representation. In this paper, we provide a framework for designing and implementing programs for distributed-memory MIMD machines. We first present an algebraic representation based on the tensor product for describing the semantics of regular data distributions. Using this algebraic representation, we develop criteria for identifying data distributions which will permit a communication-free implementation of the computation represented by a given tensor product formula. Using these criteria we develop techniques for synthesizing programs for distributed-memory machines with the goal of minimizing the communication overhead. This leads to generation of programs under two different programming models. These two models are based on difJOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING 34, 137–153 (1996) ARTICLE NO. 0051

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Algebraic Approach to Cache Memory Characterization for Block Recursive Algorithms

Multiprocessor systems usually have cache or local memory in the memory hierarchy. Obtaining good performance on these systems requires that a program utilizes the cache ef-ciently. In this paper, we address the issue of generating eecient cache based algorithms from tensor product formulas. Tensor product formulas have been used for expressing block recursive algorithms like Strassen's matrix ...

متن کامل

Generating Efficient Programs for Two-Level Memories from Tensor-products

This paper presents a framework for synthesizing eecient out-of-core programs for block recursive algorithms such as the fast Fourier transform (FFT) and Batcher's bitonic sort. The block recursive algorithms considered in this paper are described using tensor (Kronecker) product and other matrix operations. The algebraic properties of the matrix representation are used to derive eecient out-of...

متن کامل

A Programming Methodology for Designing Block Recursive Algorithms

In this paper, we use the tensor product notation as the framework of a programming methodology for designing block recursive algorithms. We first express a computational problem in its matrix form. Next, we formulate a matrix equation for the matrix of the computational problem. Then, we try to find a solution of the matrix equation such that the solution is composed of simple matrices. Finall...

متن کامل

A technique for overlapping computation and communication for block recursive algorithms

This paper presents a design methodology for developing efficient distributed-memory parallel programs for block recursive algorithms such as the fast Fourier transform (FFT) and bitonic sort. This design methodology is specifically suited for most modern supercomputers having a distributed-memory architecture with a circuit-switched or wormhole routed mesh or a hypercube interconnection networ...

متن کامل

Synthesizing Eecient Out-of-core Programs for Block Recursive Algorithms Using Block-cyclic Data Distributions

In this paper, we present a framework for synthesizing I/O eecient out-of-core programs for block recursive algorithms, such as the fast Fourier transform (FFT) and block matrix transposition algorithms. Our framework uses an algebraic representation which is based on tensor products and other matrix operations. The programs are optimized for the striped Vitter and Shriver's two-level memory mo...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • J. Parallel Distrib. Comput.

دوره 34  شماره 

صفحات  -

تاریخ انتشار 1996